Spotify
2023-03-30
Introduction
Many people across the world listen to music via Spotify. Spotify and other third parties collect data on which songs are the most popular during a given day and are able to analyze certain characteristics about the song. Just like the data scientists at Spotify our STAT 240 group wanted to see if we could analyze a characteristic about popular songs from Spotify and apply the statistical methods we have learned in class.
Our Question: Is there a significant difference between the proportion of explicit songs in the Top 0 to 50 vs Top 51 to 100 in the Billboard top 100 playlist in Spotify?
The motivation behind this question is quite simple. We simply wanted to see if perhaps more explicit songs would be in the top 50 vs the bottom 50 of the top 100. Music nowadays seems to be more explicit than ever so perhaps we may notice something.
This analysis is meant to see if there is a significant difference (95% confidence interval) between the proportion of explicit songs in the Top 0-50 vs Top 51-100 in the Billboard top 100 playlist in Spotify. The authors hope to find a significant difference between the proportions using a 2-sample z-test for difference in proportions.
Background
- The dataset is the Billboard top 100 songs from March 30th 2023. The playlist is shown below.
The playlist was scraped from the Spotify website using an API called Exportify that returns a CSV of the playlist.
The csv includes the song name, URL, artist name, album name, whether the song is explicit, etc. in a tidy format.
Note: Spotify determines if a song is explicit by examining whether a song contains one or more offensive or inappropriate words in its lyrics.
Billboard Top 100 is determined by a combined measure of both sales and airplay. Sales figures are provided by retailers such as iTunes and Amazon, while Nielsen BDS tracks radio airplay.
for the rest of the report we will find the proportions of explicit songs in the top 0-50 vs top 51-100 songs in the Billboard Top 100 playlist and then we will do a 2-sample z-test for difference in proportions and a simulation to see if there is a statistically significant difference between the proportion of explicit songs in the 2 groups.
We assume that the Billboard top 100 songs on March 30th, 2023 are an accurate random sample of the populations of interest, namely, we assume that the explicit or non-explicit nature of these top 50 songs on 3/30/2023 are representative of the explicit or non-explicit nature of all top 50 songs in the Billboard top 100 list for all time. We assume that the explicit or non-explicit nature of the top 51-100 songs on 3/30/2023 is representative of the explicit or non-explicit nature of all top 51-100 songs in the Billboard top 100 list for all time.
Analysis
| p_hat1 | p_hat2 | diff |
|---|---|---|
| 0.38 | 0.32 | 0.06 |
The proportion of songs that are explicit in the Top 0-50 is 0.38 represented by p_hat1
The proportion of songs that are explicit in the Top 51-100 is 0.32 represented by p_hat2
The difference between the proportions is 0.06 as shown by diff
Are this difference significant? Let’s consider multiple approaches.
Statistical Model
The statistical model is:
- \(p_1\) is the true probability that a song in the Top0-50 is explicit
- \(p_2\) is the true probability that a song in the Top51-100 is explicit
\[ X_1 \mid p_1 \sim \text{Binomial}(50,p_1) \\ X_2 \mid p_2 \sim \text{Binomial}(50,p_2) \]
Confidence Interval from Simulation
Using the p_hat1 and p_hat2 from above along with the size of each sample (50 songs) we will simulate many independent samples and use these for the unknown \(p_1\) and \(p_2\).
We will then calculate the standard deviation of these differences. Because this is a simulation we will use the standard deviation of the difference between \(p_1\) and \(p_2\) as our Standard Error
| SE Simmulation |
|---|
| 0.0951949 |
- we can then use this Standard Error to Finish the 95% confidence interval which goes by the formula
| Point_estimate | SE | z | Low | High |
|---|---|---|---|---|
| 0.06 | 0.0951949 | 1.959964 | -0.1265785 | 0.2465785 |
- We are 95% confident that difference in the proprtion of explicit songs between the Top 0-50 vs Top 51-100 is between is between 0.24 higher in the Top 0-50 vs Top 51-100 and 0.12 higher in the Top 51-100 vs Top 0-50
Confidence Interval using formula
- Instead of doing a simulation we can use this formula to get the standard error:
\[ \text{SE}(\hat{p}_1 - \hat{p}_2) = \sqrt{ \frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2} } \]
- We then use the same formula to get the confidence interval but instead of using a simulated SE we use the calculated SE
| se | Point_estimate | z | Low | High |
|---|---|---|---|---|
| 0.095205 | 0.06 | 1.959964 | -0.1265985 | 0.2465985 |
Hypothesis Tests for Testing Differences
Inference question
- Is there a statistically significant difference between the proportion of explicit songs in the Top 0 to 50 vs Top 51 to 100 in the Billboard top 100 playlist in Spotify?
| Group | Explicit Songs | Total Songs | p_hat |
|---|---|---|---|
| Top 0-50 | 19 | 50 | 0.38 |
| Top 51-100 | 16 | 50 | 0.32 |
In the Billboard Top 100 playlist the Top 0-50 songs had 38% of their songs explicit compared to the Top 51-100 songs had 32% of their songs explicit
We wish to explore if there a statistically significant difference between the proportion of explicit songs in the Top 0-50 vs Top 51-100
Testing from Simulation
Statistical Model
The statistical model is:
- \(p_1\) is the probability that a song is explicit in the Top0-50
- \(p_2\) is the probability that a song is explicit in the Top51-100
\[ X_1 \mid p_1 \sim \text{Binomial}(50,p_1) \\ X_2 \mid p_2 \sim \text{Binomial}(50,p_2) \]
State hypotheses:
\[ H_0: p_1 = p_2 \\ H_a: p_1 \neq p_2 \] - The null hypothesis is there is no difference in the proportion of explicit songs in the Top 0-50 vs Top 51-100
The alternative hypothesis is that there is a difference in the proportion of explicit songs in the Top 0-50 vs Top 51-100
We will pick our significance level alpha=0.05 because this is the conventional level of significance in statistics
Calculating a test statistic:
Our Test statistic is simply going to be the difference in sample proportions, \(\hat{p}_1 - \hat{p}_2\).
If the null hypothesis is true, then this statistic is expected to be close to zero with differences caused by random sampling variation.
However, if the null hypothesis is false, then this statistic should be different from zero in a way such that the difference is not caused by random fluctuations
| test_stat |
|---|
| 0.06 |
Determining the null sampling distribution of the test statistic
If the null hypothesis is true, then \(p_1 = p_2\) and the distribution of the test statistic is whatever it is when \(X_1\) and \(X_2\) are drawn with the same success probability \(p\)
To estimate \(p\) we combine both samples from our data: \[ \bar{p} = \frac{X_1 + X_2}{n_1 + n_2} = \frac{19 + 16}{50 + 50} = 0.35 \]
| p0 |
|---|
| 0.35 |
Calculating the p-value via simulation - the number of replications will be \(1,000,000\)
We will generate binomial random variables \(X_1\) and \(X_2\)
then we will find the difference between those sample proportions
Finally we will calculate a p value to see often a value as extreme as that from the original data occurs. Our alternative hypothesis is \(H_a: p_1 \neq p_2\) and is two-sided
| p value from simmulation |
|---|
| 0.600682 |
- The p value 0.600682 indicates that we fail to reject the null hypothesis because the p value is greater than 0.05.
Normal approximation for p-value
- Let’s also use the Theoretical approach based on a equation. We will use a z-test for difference in proportions to derive a p-value.
\[ z = \frac{(\hat{p}_1 - \hat{p}_2) - 0}{\text{SE}} \]
We are able to do this normal approximation because the sample sizes are large enough (\(n=50\)) and the estimated \(p\) is not close to either 0 or 1. \(np(1-p) = 12.5 > 10\), and np(1-p) > 10 is the rule of thumb we use for when it is appropriate to apply the normal approximation.
This is the equation for the standard error of the difference between 2 different proportions:
\[ \text{SE}(\hat{p}_1 - \hat{p}_2) = \sqrt{ \frac{p_1(1-p_1)}{n_1} + \frac{p_2(1-p_2)}{n_2} } \]
To estimate \(p\) we combine both samples from our data using the formula below \[ \bar{p} = \frac{X_1 + X_2}{n_1 + n_2} = \frac{359 + 83}{610 + 180} \doteq 0.559 \]
First let’s find the z statistic
| z statistic |
|---|
| 0.6289709 |
- Using this z statistic we can get the p-value which is twice the area to the right of z under the standard normal curve.
| p value from z statistic |
|---|
| 0.5293681 |
- The simulation and the theory results in essentially the same numerical value of the p-value which is interpreted in the same way (i.e. we fail to reject Null hypothesis at the alpha=0.05 significance level)
Graphical representation
- The orange dotted line would need to be in one of the purple shaded areas in order for this test to have given us statistically significant results (2 proportion z test to compare the true proportions in two populations)
Discussion
We did not find any evidence that the true proportion of explicit songs in Spotify was statistically significant at the alpha=0.05 level between the songs in the top 1-50 and the top 51-100 songs.
It is possible that we made a type II error in falsely failing to reject the null hypothesis. This outcome is possible because we had a small sample size and it is possible that the true difference in proportions of the explicit songs in the top 1-50 versus the top 51-100 is small but nonzero.
For future work, we propose that the relationship between the explicit songs and time be explored. We propose fitting a linear regression model to predict the proportion of explicit songs in the top 100 songs over time. We hypothesize that as the music industry has evolved, this proportion has fluctuated and believe it would be interesting to explore the relationship between time and the percent of explicit songs.
Additionally, we propose a stronger test be conducted to further examine the hypothesis of a difference between the percent explicit songs in the top 1-50 compared to the percent of explicit songs in the top 51-100. We suggest that a random sample of days in the past three years be gathered, and for those years, the percent of explicit songs in the top 1-50 versus the top 51-100 be computed. Then, we propose an additional significance test be conducted to determine if there is truly a difference between the two populations. This larger dataset would help the difference if there merely a small difference, one that was not statistically significant in our test because our test only uses 100 data points and therefore is not a powerful test for detecting slight differences between two populations.
In summary, we fail to reject the null hypothesis. Namely, we do not have statistically signiifcant evidence at the alpha=0.05 level that there is a difference between the true proportion of explicit songs in the top 1-50 compared to the true proportion of explicit songs in the top 51-100. Our theoretical and simulated confidence interval for the difference in proportions includes 0, and our theoretical and simulated significance tests have p values > 0.05.
Reference
Is there an Explicit Content filter? - The Spotify Commun. . .. (2023, January 28). https://community.spotify.com/t5/FAQs/Is-there-an-Explicit-Content-filter/ta-p/4631272↩︎
Spotify - About Spotify. (2023, March 9). Spotify. https://newsroom.spotify.com/company-info/↩︎
Trust, G. (2014, September 7). Billboard. Billboard. https://www.billboard.com/pro/ask-billboard-how-does-the-hot-100-work/↩︎
What Is The Billboard Hot 100? (2023, January 2). Edmsauce. Retrieved April 10, 2023, from https://www.edmsauce.com/what-is-the-billboard-hot-100/↩︎